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Abstract 

Annotated speech corpora are databases 
consisting of signal data along with time-aligned 
symbolic 'transcriptions'. Such databases are 
typically multidimensional, heterogeneous and 
dynamic. These properties present a number 
of tough challenges for representation and 
query. The temporal nature of the data adds 
an additional layer of complexity. This paper 
presents and harmonises two independent efforts 
to model annotated speech databases, one at 
Macquarie University and one at the University 
of Pennsylvania. Various query languages are 
described, along with illustrative applications to 
a variety of analytical problems. The research 
reported here forms a part of several ongoing 
projects to develop platform- independent open- 
source tools for creating, browsing, searching, 
querying and transforming linguistic databases, 
and to disseminate large linguistic databases over 
the internet. 

1. Databases of Annotated Speech 
Recordings 

Annotated corpora have been an essential com- 
ponent of research and development in language- 
related technologies for some years. Text corpora 
have been used for developing information 
retrieval and summarisation software (e.g. MUC 
JIT|], TREC [|l4|), automatic taggers and parsers 
and machine translation systems [^J. In a similar 
way, annotated speech corpora have proliferated 
and have found uses across a rapidly expanding 
set of languages, disciplines and technologies 
[www . ldc . upenn .edu/ annotation/]. 



Over the last 7 years, the Linguis- 
tic Data Consortium (LDC) has pub- 
l ished over 150 text and speech databases 
[www . ldc . upenn . edu /Cat a log/]. 

Typically, such databases are specified at the 
level of file formats. Linguistic content is anno- 
tated with a variety of tags, attributes and values, 
with a specified syntax and semantics. Tools are 
developed for each new format and linguistic do- 
main on an ad hoc basis. These systems are akin 
to the databases of the 1960s. There is a physical 
representation along with a hand-crafted program 
offering a single view on the data. Recently, the 
authors have shown how the three-level architec- 
ture and the relational model can be applied to an- 
notated speech databases [|[ Q]. The goal of this 
paper is to illustrate our two approaches and to de- 
scribe ongoing research on query algebras. 

Before presenting the models we give an ex- 
ample of a collection of speech annotations. This 
illustrates the diversity of the physical formats 
and gives an idea of the challenge involved in 
providing a general-purpose logical characteri- 
sation of the data. The Boston University Radio 
Speech Corpus consists of 7 hours of radio news 
stories [www . ldc . upenn . edu/Catalog/ 
LDC 9 6S 3 6 . html]. The annotations include four 
types of information: orthographic transcripts, 
broad phonetic transcripts (including main word 
stress), and two kinds of prosodic annotation, 
all time-aligned to the digital audio files. The 
two kinds of prosodic annotation implement 
the system known as ToBI - Tones and Break 
Indices [www . ling . ohio-state . edu/ 
phonetics/E_ToBI/]. We have added three 
further annotations: coreference annotation and 
named entity annotation in the style of MUC-7 
[www . muc . saic . com/proceedings / 



muc_7_toc.html], and syntactic structures in 
the style of the Penn TreeBank [|To||. Fragments of 
the physical data are shown in Figure 

Coreference annotation (Figure m, top left) 
associates a unique identifier to each noun 
phrase and a reference attribute which links each 
pronoun to its antecedent. The set of coreferring 
expressions is considered to be an equivalence 
class. Named-entity annotation (top centre) 
identifies and classifies numerical and name 
expressions. Penn Treebank annotation provides a 
syntactic parse of each sentence. The word-level 
annotation (bottom left) gives the end time of each 
word (a second offset into the associated signal 
data). The syllable annotation gives the Arpabet 
phonetic symbols (see [www . ldc . upenn . 
edu/doc/timit/phoncode . doc]). The 
tonal annotation provides time points and into- 
national units, and the part of speech annotation 
(bottom right) specifies the syntactic category 
of each word. This is but a small sample of the 
bazaar of data formats. 

2. Data Models for Speech Databases 

Two database models for multi-layered speech 
annotations have been developed by the authors. 
The Emu model (Macquarie) organises the data 
primarily in terms of its hierarchical structure, 
while the annotation graph model (Penn) fore- 
grounds the temporal structure. In separate work 
we demonstrate the expressive equivalence of the 
two models [|J ^] . Here we give a brief overview 
of both models. In the remainder of this paper 
we will consider mainly the annotation graph 
data model, while the Emu system serves as an 
example of a working speech database system. 

2.1. The Emu model 

The Emu speech database system 
[ www . shlrc . mq . edu . au/emu ] [|[ |[] 



provides tools for creation, query and analysis 
of data from annotated speech databases. Emu 
is implemented as a core C++ library and a 
set of extensions to the Tel scripting language 
which provide a set of basic operations on speech 
annotations. Emu provides a flexible annotation 
model into which a number of existing label file 
formats can be read. 

The Emu annotation model is based on a set of 
levels which represent different types of linguis- 
tic data such as words, phonemes or pitch events. 
Each level contains a set of tokens which have one 



or more labels and optionally a start and end time 
relative to an associated speech signal. Within a 
level, tokens are stored as a partial order represent- 
ing thier sequence in the annotation: each token 
may have zero or more previous and next tokens. 
The partial ordering must respect timing informa- 
tion if it is present in the tokens: that is, a token 
cannot follow a token with an later start time. 

Within and between levels, tokens may be re- 
lated by either domination or association relations. 
Domination relations relate a parent token to an or- 
dered sequence of constituent child tokens and im- 
ply that the start and end times of the parent could 
be inferred from those of the children. Associa- 
tion relations have no in-built semantics and can 
be used for any application specific relation, such 
as that between a word and a tone target which 
denotes the point at which word stress is realised 
(Figure ||). Relations may be defined between any 
pair of levels which allows Emu to handle inter- 
secting hierarchies such as that illustrated in Fig- 
ure n 

2.2. The annotation graph model 

A second general purpose model supporting 
multiple independent hierarchical transcriptions of 
the same signal data is known as the annotation 
graph This model forms the heart of a joint 



initiative between LDC, NIST [www . nist . gov] 



and MITRE [www . mitre . org] to develop an ar- 



chitecture and tools for linguistic analysis systems 
(ATLAS), and an NSF-sponsored project between 
LDC, the Penn database group, and the CMU Psy- 
chology and Informedia departments, to develop a 
multimodal database of communicative interaction 



called Talkbank [ www . talkbank . org ). 

Annotation graphs are labelled DAGs with time 
references on some of the nodes. Bird and Liber- 
man have demonstrated that annotation graphs are 
sufficiently expressive to encompass the full range 
of current speech annotation practice. A simple 
example of an annotation graph is shown in Fig- 
ure pi for a corpus known as TIMIT [|[. Anno- 
tation graphs (AGs) have the following structure. 
Let L — Li be the label data which occurs on 
the arcs of an AG. The nodes N of an AG reference 
signal data by virtue of a function mapping nodes 
to time offsets T. AGs are now defined as follows: 

Definition 1 An annotation graph G over a label 
set L and a timeline T is a 3-tuple (N, A, r) con- 
sisting of a node set N, a collection of arcs A la- 
belled with elements of L, and a time function t, 
which satisfies the following conditions: 



Coreference Annotation 



<COREF ID="2" MIN="woman"> 

This woman</COREF> 
receives three hundred dollars 
a month under 
<COREF ID="5"> 

General Relief</COREF>, plus 
<COREF ID="16" 

MIN=" four hundred dollars " > 
four hundred dollars a month in 
<COREF ID="17" 

MIN="benefits" REF="16"> 
A.F.D.C. benefits</COREF> 
</COREF> for 

<COREF ID="9" MIM="son"> 
<COREF ID="3" REF="2"> 
her</COREF> son 
</COREF>, who is 

<COREF ID="10" MIN="citizen" REF="9" 

a U.S. citizen</COREF> . 
<COREF ID="4" REF="2"> 

She</COREF>' s among 
<COREF ID="18" MIN="aliens"> 

an estimated five hundred illegal 

aliens on 

<COREF ID="6" REF="5"> 

General Relief </COREF> 
out of 

<COREF ID="11" MIN="population"> 
<COREF ID="13" MIM="state"> 

the state</COREF>' s 
total illegal immigrant 
population of 
<COREF ID="12" REF="11"> 

one hundred thousand 
</COREF> 
</COREF> 
</COREF>. 

<COREF ID="7" REF="5"> 

General Relief </COREF> 
is for needy families and 
unemployable adults who 



Named Entity 
Annotation 



<b_numex TYPE= "MONEY" > 
three hundred dollars 

a month under General 
Relief, plus 
<b_numex TYPE= "MONEY" > 
four hundred dollars 

a month in A.F.D.C. 
benefits for her 



U.S . 

citizen, brth She's among 
an est imated five hundred 
illegal aliens on General 
Relief brth out of the 
state' s total illegal 
immigrant population of 
one hundred thousand, brth 
General Relief is for 
needy families and 
unemployable adults brth 
who don' t gualify for other 

public assistance, brth 
<b_enamex TYPE=" ORGANIZATION" 
Welfare Department 

<b_enamex TYPE="PERSON" > 
Michael Reganburg 

brth says the state will 
save about 

<b_numex T YP E = " MONEY " > 

one million dollars 
<e_numex> 

a year if illegal aliens 
are denied General Relief. 



Penn Treebank Annotation 



(NP-SBJ This woman) 
(VP receives 
{NP 
(NP 

{NP (QP three hundred) dollars) 
{NP-ADV a month) 
{PP under 
(NP General Relief) ) ) , plus 
(NP 

{NP (QP four hundred) dollars 
) 

{NP-ADV a month) 
(PP in 

(NP A.F.D.C. benefits) ) ) ) 
(PP for 
(NP 

{NP her son) , 
{ SBAR (WHNP-1 who) 
(S (NP-SBJ *T*-1) 
{VP is 

(NP-PRD a U.S. citizen))))))) 



( IS 

(NP-SBJ She) 
(VP 's 
(PP-PRD among 
(NP {NP an estimated 

{QP five hundred) illegal aliens) 
(PP on 

{NP General Relief) ) 
(PP out of 
(NP 
(NP 

{NP the state 's) 

total illegal immigrant population) 
(PP of 
(NP 

(QP one hundred thousand) ))))))) 
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Figure 1. Multiple Annotations of the Boston University Radio Speech Corpus 




Figure 2. An example utterance from the TIMIT database which has been augmented 
with both a syntactic annotation and a ToBI style intonational annotation. The names 
of the levels are shown on the left, the Word level has been duplicated to show the 
links to both the syntactic and intonatational hierarchies. The single Tone event H* 
is associated with the word 'dark'. Time information at the phoneme level is used to 
derive times for all higher levels. 



1. (N, A) is an acyclic digraph labeled with el- 
ements of L, and containing no nodes of de- 
gree zero; 

2. t : N — T, such that, for any path from node 
n\ to 7i2 in A, ifr(ji\) and t(ji-i) are defined, 
thenr(ni) < r(n2); 

Note that AGs may be disconnected or empty, 
and that they must not have orphan nodes. The 
AG corresponding to the Emu annotation structure 
in Figure |[ for the first five words of a TIMIT 
annotation, is given in Figure ||. The arc types are 
interpreted as follows: S - syntax; W - word; P 
- phoneme; T - tone; Imt - intermediate phrase; 
Itl - intonational phrase. 

3. Annotations as Relational Tables 

Annotation data expressed in either the Emu 
or annotation graph data models can be trivially 
recast as a set of relational tables [Q]. For the 
purposes of this paper it is instructive to consider 
the relational form of annotation data in order to 
explore the requirements for a query language for 
these databases. 



An annotation graph can be represented as a 
pair of tables, for the arc relation and time rela- 
tions. The arc relation is a six-tuple containing 
an arc id, a source node id, a target node id, and 
three labels taken from the sets L\, L%, L$ respec- 
tively. The choice of three label positions is some- 
what arbitrary, but it seems to be both necessary 
and sufficient for the various annotation structures 
considered here. 

We let Li be the set of types of transcript in- 
formation (e.g. 'word', 'syllable', 'phoneme'), and 
let Li be the substantive transcript element (e.g. 
particular words, phonetic symbols, and so on). 
We let L3 be the names of equivalence classes, 
used here to model so-called 'phonological asso- 
ciation' . (This kind of association is discussed in 
depth in [jl[].) Let T be the set of non-negative 
integers, the sample numbers. Figure gives an 
example for the TIMIT data of Figures || ||. 

We form the transitive closure of the (unla- 
beled) graph relation to define a structural (graph- 
wise) precedence relation using a datalog program: 

s_prec (X, X) :- 

s_prec(X,Y) :- arc (_, X, Y, _, _, _) 
s_prec(X,Y) :- s_prec(X,Z), 

arc (_, Z, Y, _,_,_) 




Figure 3. TIMIT Graph Structure 
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Figure 4. The Arc and Time Relations 



Now we further define a temporal precedence 
relation, where leq is the < relation (minimally 
defined on the times used by the graph): 

t_precO(X,Y) :- time(X, Tl), 
time (Y, T2) , 
leq (X, Y) 

t_prec(X,Y) :- t_precO(X,Y) 
t_prec(X,Y) :- t_prec(X,Z), 
t_precO (Z, Y) 

4. Exploring Annotated Linguistic 
Databases 

4.1. General architecture 

In our experience with the analysis of linguistic 
databases, we have found a recurrent pattern of use 
having three components which we will call query, 
report generation, and analysis. 

The query system proper can be viewed as a 
function from annotation graphs to sets of sub- 
graphs, i.e. those meeting some (perhaps complex) 
condition. The report generation phase is able 
to access these query results, but also the signals 



underlying the annotations. For example, the re- 
port generation phase can calculate such things as 
'mean F2 in signal S during time interval (ti, t%)? 
Each hit constitutes an 'observation' in the statis- 
tician's sense, and we extract a vector of specified 
values for each observation, to be passed along to 
the analysis system. The analysis phase is then 
some general-purpose data crunching system such 
as Splus or Matlab. 

This architecture saves us from having to in- 
corporate all possible calculations over annotated 
signals into the query language. The report gener- 
ation phase can perform such calculations, as well 
as compute properties of the annotation data itself. 
This seems to simplify the query system a good 
deal; now things like 'count the number of sylla- 
bles to the end of the current phrase' (which we 
do need to be able to do) are tasks for the report 
generator, not the query system proper. 

In general, the result of a query is a set of 
sub-graphs, each of which forms one matching in- 
stance. If we use the relational model proposed 
above, these would be returned as a result table 
having the same structure as the arc relation of Fig- 
ure |], but containing just the tuples which took part 
in each matching instance. We are then faced with 



the problem of how to differentiate the matching 
instances, for example, if we wished to collect to- 
gether the word labels for the query 'find all words 
dominated by noun phrases' we need some way 
of treating each sub-graph separately. Hence, we 
would prefer the result to be a set of tables rather 
than a single table containing all matching tuples. 

In a sense, then, the only role of the query is to 
define an iterator for the report generator over a set 
of sub-graphs of the overall annotation graph. 

The Emu query language 

The Emu query language uses simple con- 
ditions on token labels which match only 
tokens at a specified level, for example: 
Phonetic=A | I | | U | E | V. These conditions 
can be combined by sequence, domination or 
association operators to constrain the relational 
structure of the tokens of interest. Examples of 
each are: 

Find a sequence of vowel followed by stop 
at the phoneme level: 

[Phoneme=vowel -> Phoneme=stop] 

Find Words not labelled x dominating vowe 1 
phonemes: 

[Word!=x " Phoneme=vowel ] 

Find words associated with H* tones: 
[Word!=x => Tone=H*] 

The Word ! =x query is intended to match any 
word in lieu of a query language construct which 
allows matching any label string. 

Each query matches either a token or, in the 
case of the sequence query, a sequence of tokens. 
The result of a domination or association query is 
the result of the left hand side of the bracketed 
term; this can be changed by marking the right 
hand side term with a hash (#). Compound queries 
can be arbitrarily nested to specify complex con- 
straints on tokens. As an example the following 
query finds sequences of stop and vowels domi- 
nated by strong syllables where the vowel is asso- 
ciated with an H* tone target, the result is a list 
of the vowel labels with associated start and end 
times. 

[Syllable=S * 

[Phoneme=stop -> 

[Phoneme=vowel => Tone=H*]]] 

The result of an Emu query is a table with one 
entry per matching token: 

database : timit 
query : Phoneme ! =x 
type : segment 
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This table is used to extract any of the associated 
time-series data associated with the database, an 
operation usually carried out from an analysis en- 
vironment such as Splus or XlispStat. Emu pro- 
vides libraries of analysis functions for these en- 
vironments which facilitate, for example, mapping 
signal processing operations over each token in a 
query result or overlaying plots of the time series 
data for each token. 

Although this query system has proved useful 
and useable in the environment of acoustic pho- 
netics research, it is now evident that there are a 
number of shortcomings which prevent it's wider 
use. The query syntax is unable to express some 
queries, such as those involving disjunction or op- 
tional elements, and the query result is only re- 
ally useful for data extraction. It is for these rea- 
sons that we are now looking more formally at the 
requirements for a query language for annotation 
data. 

4.2. A query language on annotation graphs 

A high-level query language for annotation 
graphs, founded on an interval-based tense logic, 
is currently being developed and will be reported 
in a later version of this paper. 

Here we describe a variety of useful queries on 
annotation graphs and formulate them as datalog 
programs. As we shall see, it turns out that datalog 
is insufficiently expressive for the range of queries 
we have in mind. Finding a more expressive yet 
tractable query language is the focus of ongoing 
research. 

A number of simple operations, extending our 
two relations arc/6 and time/2, will be neces- 
sary for succinct queries. The first and most obvi- 
ous is for hierarchy. Observe in Figure || that there 
is a notion of structural inclusion defined by the 
arcs. We formulate this as follows: 

s_incl ( I , J) : - 

arc (I, W, Z, _,_,_) , 
arc (J, X, Y, _,_,_) , 
s_prec(W,X), s_prec(Y,Z) 

Now, since s.prec is reflexive, so is s_incl. 
Observe that nodes 3 and 6 in Figure [3] are con- 
nected by both an s/v arc and a w/had arc. The 



syntactic verb arc S /v should dominate the word 
arc w/had, but not vice versa. Therefore we need 
to have a hierarchy defined over the types. We 
achieve this with a (domain-specific) ordering on 
the type names: 

type_hierarchy (word, syl) 
type_hierarchy (syl, seg) 

Now dominance is expressed by the predicate: 

dom(I,J) :- 

arc (I,_,_, Ll,_,_) , 
arc(J,_,_,L2,_,_) , 
type_hierarohy (LI , L2 ) , 
s_incl ( I , J) 

In some cases it is necessary to have an intran- 
sitive dominance relation that is sensitive to phrase 
structure rules. For simplicity of presentation, we 
assume binary branching structures. The first of 
the rules below states that a sentence arc s will 
immediately and exhaustively dominate an np arc 
followed by a vp arc. 

ps_rule ( s , np, vp) 
ps_rule (np, det, n) 
ps_rule (vp, v, np) 

Now we define immediate dominance over the 
syntax arcs syn as follows: 

i_dom ( I , J) : - 

arc (I, X, Z, syn, P,_) , 
ps_rule (P, CI, C2) , 
arc (J, X, Y, syn, Cl,_) , 
arc (_, Y , Z, syn, C2,_) 

i_dom ( I , J) : - 

arc (I, X, Z, syn, P,_) , 
ps_rule (P, CI, C2) , 
arc (_, X, Y, syn, Cl,_) , 
arc ( J, Y, Z, syn, C2,_) 

Another widely used relation between arcs is 
association. In the instance of the AG model in 
Figure [I[ association amounts to sharing the value 
of L3, as we saw in the tuples for dark and H* in 
Figure The assoc predicate simply does a join 
on the third label field: 

assoc ( I , J) :- 

arc (I, A) , 
arc (J, A) 

Finally, it is convenient to have a kleene star 
relation. Unfortunately in datalog we are unable to 
collect up the arbitrary length sequence it matches. 
Here we have it returning the two nodes which 
bound the sequence, which is often enough to 
uniquely identify the sequence in practice. 



node(N) :- arc (_, N, _) 
node(N) :- arc (_, _, N, _, _, _) 

kleenel (X, X,_) :- node (X) 
kleenel (X, Y, L) :- 

arc(_,X,Z,L,_,_) , 

kleenel (Z, Y, L) 

kleene2 (X, X,_) :- node (X) 
kleene2 (X, Y, L) :- 

arc (_, X, Z,_, L,_) , 

kleene2 (Z, Y, L) 

kleene3 (X, X,_) :- node (X) 
kleene3 (X, Y, L) :- 

arc (_, X, Z,_,_, L) , 

kleene3 (Z, Y, L) 

With this simple machinery we can start defin- 
ing some annotation queries. 
Find a sequence of vowel followed by stop at the 
phoneme level (assumes suitably defined vowel 
and stop unary relations): 

vowel_stop ( I , J) :- 

arc ( I , _, Y, phoneme, V, _) , 
arc ( J, Y, _, phoneme, S , _) , 
vowel (V) , stop(S) 

If we do not want both the vowel and the stop, but 
just the vowel, we could write: 

vowel_stop ( I ) :- 

arc ( I , _, Y, phoneme, V, _) , 
arc (_, Y, _, phoneme, S , _) , 
vowel (V) , stop(S) 

Find words dominating vowel phonemes: 

strongWrdDomVowels ( I ) :- 
arc(I,_,_,word,s,_) , 
arc ( J, _, _, phoneme, V, _) , 
vowel (V) , 
dom ( I , J) 

Find words associated with H* tones: 

sylHtone(I) :- 

arc(I,_,_,word,_,A) , 
arc(_,_,_,tone,h*,A) 

Find stop-vowel sequences dominated by words in 
noun phrases where the word is associated with an 
H* tone target. 

stop_vowel_seq ( I , J) :- 

arc ( I ,_, Y, phoneme , S ,_) , stop(S), 
arc ( J, Y, _, phoneme, V, _) , vowel (V) , 
arc(W,_,_,word,_,_) , 
arc (N, _, _, syn, np, _) , 
dom(N,W), dom(W,I), dom(W,J), 
arc ( T, _, tone, h* ,_) , assoc(W,T) 



Find the intermediate phrase containing the main 
verb of a sentence: 

imt_phrase (P ) :- 

arc (K, _, _, syn, s, _) , 

arc (J, _, _, syn, vp, _) , 

arc (I, _, _, syn, v, _) , 
i_dom(K,J), i_dom(J, I), 
dom(P, I), 

arc (P, _, _, imt, _, _) 

Return the set of syllables between an H* and an 
L% tone (inclusive). 

syls (K) :- 

arc (_, _, N, tone, h*, Al), 
arc (_, N, _, tone, 1%, A2 ) , 
arc(I, _, Nl, syl, _, Al ) , 
arc (J, N2, _, syl, _, A2 ) , 
kleenel(Nl, N2, syl), 
arc(K, N2, N3, syl,_,_), 
kleenel(N3, N4, syl) 

The above query shows how the datalog model 
breaks down. We would like it to return sets of 
sets of syllable arcs. Instead it returns a flat set 
structure. In many cases we will know that some 
arc participating in the query expression can be 
used to recover the nested structure. For example, 
if the head of the above clause was changed from 
syls (K) to syls (I, K) , then I will aggregate 
K in just the right way. 

5. Applying XML Query Languages to 
Annotations 

It is worth briefly considering the suitability of 
existing XML query languages such as XML-QL 
[§] and XQL |12] for the domain of annotated 
speech. At first glance the problems we face 
querying annotated speech data are similar to 
those present with XML queries in that both 
present a hierarchical data model. A number 
of formulations of annotation data as XML 
are possible, indeed some projects make use 
of XML /SGML based formats e ntirely (e.g. 
MATE [ jnate . nis . sdu . dklpq] ], LACITO 
[lacito.vjf .cnrs. fr/ARCHIVAG/ 
ENGLISH . htm]). XML can represent trees using 
properly nested tags, in the obvious way. In order 
to represent multiple independent hierarchies built 
on top of the same material one must construct 
trees using IDREF pointers. This idea was 
proposed by the Text Encoding Initiative [ |l3| ] and 
recently adopted by the MATE project. We be- 
lieve this approach is vastly more expressive than 
necessary for representing speech annotations, 



and we prefer a more constrained approach having 
desirable computational properties with respect to 
creation, validation and query. 

The XQL proposal [@ describes a query lan- 
guage which is intended to select elements from 
within XML documents according to various cri- 
teria; for example, the query text /author re- 
turns all author elements that are children of text 
elements. The XQL data model ignores the order 
of elements within a parent element and has no ob- 
vious way to query for sequences of tokens. 

The XML-QL proposal provides for a data 
model where the order of elements is respected. 
A query for a word-internal vowel-stop sequence 
could be expressed as follows (assuming suitably 
tagged annotation data for TIMIT): 

<word> 

<phoneme label = &vowel ; /> 
<phoneme label=& stop; /> 
</word> 

The result of this query would have the following 
form: 

<word label=had> 

<phoneme label=ae/> 

<phoneme label=dcl/> 
</word> 

<word label=dark> 

<phoneme label=ar/> 

<phoneme label=k/> 
</word> 

Queries which refer to two independent hier- 
archies, such as syntactic and intonational phrase 
structure, need to use joins. For example, to find 
words that are simultaneously at the end of both 
clauses and intermediate phrases, we could have 
the following query: 

<intermediate> 

<word id=$i></> [end ( ) ] 
</intermediate> 

<clause> 

<word id=$i></> [end ( ) ] 
</clause> 

We assume the existence of some mechanism 
to pick out the last child element. The ID attribute 
ensures that the words are the same in each part of 
the join. 

Perhaps either of these approaches could be 
made to work for a useful range of query needs. 
However they do not appear to be sufficiently 
general. For example, it is often useful to 



have query expressions involving kleene star: 
'select all pairs of consonants, ignoring any 
intervening vowels' (CV*C). Such queries may 
ignore hierarchical structure, finding sequences 
across (say) word boundaries. Using regular 
expressions over paths, XML-QL could provide 
access to strings of terminal symbols ignoring 
intervening levels of hierarchy. Yet it does not 
provide regular-expression matching over those 
sequences. Alternatively, sequences at each level 
of a hierarchy could be chained together using 
IDREF pointers, but it is unclear how we would 
manage closures over such pointer structures. 

6. Conclusions 

Annotated speech corpora are an essential com- 
ponent of speech research, and the variety of for- 
mats in which they are distributed has become a 
barrier to their wider adoption. To address this 
issue, we have developed two data models for 
speech annotations which seem to be sufficiently 
expressive to encompass the full range of practice 
in this area. We have shown how the models can 
be stored in a simple relational format, and how 
many useful queries in this domain are first-order. 
However, existing query languages lack sufficient 
expressive power for the full range of queries we 
would like to be able to express, and we hope stim- 
ulate new research into the design of general pur- 
pose query languages for databases of annotated 
speech recordings. 
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